Hive Join、复杂数据类型操作

Posted by Jackson on 2017-09-06

Hive 中的Join操作

Inner Join

inner join:只返回连接条件匹配上的数据

Left Join :左外连接,以左表的全部数据和右边关联的数据,缺失补NULL

Right Join:右外连接,以右表的全部数据和左边关联的数据,缺失补NULL

Full Join: 全外连接,返回全部的关联数据,缺失补NULL

hiveserver2

启动HiveServer2 服务端,使用beeline进行连接

注意:进入到Hive的bin目录执行beeline,因为Spark中也有beeline,当都配置到环境变量中的时候,后面的会覆盖前面的导致进错客户端

1
./beeline -u jdbc:hive2://bigdata01:10000/bigdata_hive -n hadoop

查看HiveServer2 的命令帮助

1
2
3
4
5
6
7
[hadoop@bigdata01 bin]$ hiveserver2 --help
19/12/19 20:21:18 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
usage: hiveserver2
--deregister <versionNumber> Deregister all instances of given
version from dynamic service discovery
-H,--help Print help information
--hiveconf <property=value> Use value for given property

hiveserver修改默认的端口号

在命令行中临时修改

1
hiveserver2 --hiveconf hive.server2.thrift.port=10066

在hive的配置文件hive-site.xml中进行修改

hive.server2.thrift.port
Default Value: 10000
Added In: Hive 0.11.0 with HIVE-2935


Hive的复杂数据类型

Array类型

数据

1
2
pk	beijing,shanghai,tianjin,hangzhou
jepson changchu,chengdu,wuhan,beijing
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
create table hive_array(
name string,
loaction array<string>)
row format delimited fields terminated by '\t' COLLECTION ITEMS TERMINATED BY ',';

load data local inpath '/home/hadoop/data/hive_array.txt' overwrite into table hive_array;
select * from hive_array
INFO : OK
+------------------+----------------------------------------------+--+
| hive_array.name | hive_array.loaction |
+------------------+----------------------------------------------+--+
| pk | ["beijing","shanghai","tianjin","hangzhou"] |
| jepson | ["changchun","chengdu","wuhan","beijing"] |
+------------------+----------------------------------------------+--+

0: jdbc:hive2://bigdata01:10000/bigdata_hive> desc hive_array;
+-----------+----------------+----------+--+
| col_name | data_type | comment |
+-----------+----------------+----------+--+
| name | string | |
| loaction | array<string> | |
+-----------+----------------+----------+--+

求每个员工的第一个工作地点

1
2
3
4
5
6
7
8
select name,location[0] from hive_array;
INFO : OK
+---------+-----------+--+
| name | _c1 |
+---------+-----------+--+
| pk | beijing |
| jepson | changchun |
+---------+-----------+--+

求每个员工的工作地点有几个

1
2
3
4
5
6
7
8
select name ,size(loaction) from hive_array;
INFO : OK
+---------+------+--+
| name | _c1 |
+---------+------+--+
| pk | 4 |
| jepson | 4 |
+---------+------+--+

求在天津上班的员工

1
2
3
4
5
6
7
select * from hive_array where array_contains(loaction,'tianjin');
INFO : OK
+------------------+----------------------------------------------+--+
| hive_array.name | hive_array.loaction |
+------------------+----------------------------------------------+--+
| pk | ["beijing","shanghai","tianjin","hangzhou"] |
+------------------+----------------------------------------------+--+

Map类型

数据

1
2
3
4
1,zhangsan,father:xiaoming#mother:xiaohuang#brother:xiaoxu,28
2,lisi,father:mayun#mother:huangyi#brother:guanyu,22
3,wangwu,father:wangjianlin#mother:ruhua#sister:jingtian,29
4,mayun,father:mayongzhen#mother:angelababy,26

建表语句:

1
2
3
4
5
6
7
8
create table hive_map(
id int,
name string,
members map<string,string>,
age int
) row format delimited fields terminated by ','
collection items terminated by '#'
map keys terminated by ':';

加载数据:

1
load data local inpath '/home/hadoop/data/hive_map.txt' into table hive_map;
1
2
3
4
5
6
7
8
+--------------+----------------+--------------------------------------------------------------------+---------------+
| hive_map.id | hive_map.name | hive_map.members | hive_map.age |
+--------------+----------------+--------------------------------------------------------------------+---------------+
| 1 | zhangsan | {"father":"xiaoming","mother":"xiaohuang","brother":"xiaoxu"} | 28 |
| 2 | lisi | {"father":"mayun","mother":"huangyi","brother":"guanyu"} | 22 |
| 3 | wangwu | {"father":"wangjianlin","mother":"ruhua","sister":"jingtian"} | 29 |
| 4 | mayun | {"father":"mayongzhen","mother":"angelababy"} | 26 |
+--------------+----------------+--------------------------------------------------------------------+---------------+

需求:求每个人的父母

1
2
3
4
5
6
7
8
9
10
select name,members['father'] as father,members['mother'] as mother from hive_map;
INFO : OK
+-----------+--------------+-------------+--+
| name | father | mother |
+-----------+--------------+-------------+--+
| zhangsan | xiaoming | xiaohuang |
| lisi | mayun | huangyi |
| wangwu | wangjianlin | ruhua |
| mayun | mayongzhen | angelababy |
+-----------+--------------+-------------+--+

求每个人的亲属关系即map的key

1
2
3
4
5
6
7
8
9
10
11
select id,name,map_keys(members) as members from hive_map;

INFO : OK
+-----+-----------+--------------------------------+--+
| id | name | members |
+-----+-----------+--------------------------------+--+
| 1 | zhangsan | ["father","mother","brother"] |
| 2 | lisi | ["father","mother","brother"] |
| 3 | wangwu | ["father","mother","sister"] |
| 4 | mayun | ["father","mother"] |
+-----+-----------+--------------------------------+--+

求每个人 亲属关系的姓名即取所有的values

1
2
3
4
5
6
7
8
9
10
11
select id,name,map_values(members) as members from hive_map;

INFO : OK
+-----+-----------+-------------------------------------+--+
| id | name | members |
+-----+-----------+-------------------------------------+--+
| 1 | zhangsan | ["xiaoming","xiaohuang","xiaoxu"] |
| 2 | lisi | ["mayun","huangyi","guanyu"] |
| 3 | wangwu | ["wangjianlin","ruhua","jingtian"] |
| 4 | mayun | ["mayongzhen","angelababy"] |
+-----+-----------+-------------------------------------+--+

求每个人的关系有几个

1
2
3
4
5
6
7
8
9
10
11
select id,name,size(members) from hive_map;

INFO : OK
+-----+-----------+------+--+
| id | name | _c2 |
+-----+-----------+------+--+
| 1 | zhangsan | 3 |
| 2 | lisi | 3 |
| 3 | wangwu | 3 |
| 4 | mayun | 2 |
+-----+-----------+------+--+

查询出有兄弟的人,以及他的兄弟是谁?

运用数组里面的array_contains,先将其转换成为数组,在对数组类型进行操作

1
2
3
4
5
6
7
select id,name,members['brother'] brother from hive_map where array_contains(map_keys(members),'brother');
+-----+-----------+----------+--+
| id | name | brother |
+-----+-----------+----------+--+
| 1 | zhangsan | xiaoxu |
| 2 | lisi | guanyu |
+-----+-----------+----------+--+

结构体

结构体内部的数据类型可以不同

structs: STRUCT<col_name : data_type [COMMENT col_comment], …>

数据:

1
2
3
4
5
[root@bigdata01 data]# cat hive_struct.txt 
192.168.1.1#zhangsan:40
192.168.1.2#lisi:50
192.168.1.3#wangwu:60
192.168.1.4#zhaoliu:70

建表:

1
2
3
4
5
create table hive_struct(
id string,
info struct<name:string,age:int>
) row format delimited fields terminated by '#'
collection items terminated by ':' ;

加载数据:

1
2
3
4
5
6
7
8
9
10
11
12
load data local inpath '/home/hadoop/data/hive_struct.txt' into table hive_struct;


INFO : OK
+-----------------+-------------------------------+--+
| hive_struct.id | hive_struct.info |
+-----------------+-------------------------------+--+
| 192.168.1.1 | {"name":"zhangsan","age":40} |
| 192.168.1.2 | {"name":"lisi","age":50} |
| 192.168.1.3 | {"name":"wangwu","age":60} |
| 192.168.1.4 | {"name":"zhaoliu","age":70} |
+-----------------+-------------------------------+--+

取数据:

1
2
3
4
5
6
7
8
9
10
select id,info.name,info.age from hive_struct;
INFO : OK
+--------------+-----------+------+--+
| id | name | age |
+--------------+-----------+------+--+
| 192.168.1.1 | zhangsan | 40 |
| 192.168.1.2 | lisi | 50 |
| 192.168.1.3 | wangwu | 60 |
| 192.168.1.4 | zhaoliu | 70 |
+--------------+-----------+------+--+

click_log数据情况:

1
2
3
4
5
create table click_log(
cookie_id string,
ad_id string,
time string
) row format delimited fields terminated by '\t';

加载数据:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
load data local inpath '/home/hadoop/data/click_log.txt' into table click_log;

+----------------------+------------------+-----------------------------+--+
| click_log.cookie_id | click_log.ad_id | click_log.time |
+----------------------+------------------+-----------------------------+--+
| 11 | ad_101 | 2014-05-01 06:01:12.334+01 |
| 22 | ad_102 | 2014-05-01 07:28:12.342+01 |
| 33 | ad_103 | 2014-05-01 07:50:12.33+01 |
| 11 | ad_104 | 2014-05-01 09:27:12.33+01 |
| 22 | ad_103 | 2014-05-01 09:03:12.324+01 |
| 33 | ad_102 | 2014-05-02 19:10:12.343+01 |
| 11 | ad_101 | 2014-05-02 09:07:12.344+01 |
| 35 | ad_105 | 2014-05-03 11:07:12.339+01 |
| 22 | ad_104 | 2014-05-03 12:59:12.743+01 |
| 77 | ad_103 | 2014-05-03 18:04:12.355+01 |
| 99 | ad_102 | 2014-05-04 00:36:39.713+01 |
| 33 | ad_101 | 2014-05-04 19:10:12.343+01 |
| 11 | ad_101 | 2014-05-05 09:07:12.344+01 |
| 35 | ad_102 | 2014-05-05 11:07:12.339+01 |
| 22 | ad_103 | 2014-05-05 12:59:12.743+01 |
| 77 | ad_104 | 2014-05-05 18:04:12.355+01 |
| 99 | ad_105 | 2014-05-05 20:36:39.713+01 |
+----------------------+------------------+-----------------------------+--+

ad_list表数据情况:

1
2
3
4
5
create table  ad_list(
ad_id string,
url string,
catalogs string
) row format delimited fields terminated by '\t';

加载数据:

1
2
3
4
5
6
7
8
9
10
11
12
load data local inpath '/home/hadoop/data/ad_list.txt' into table ad_list;

INFO : OK
+----------------+------------------------+--------------------------------------+--+
| ad_list.ad_id | ad_list.url | ad_list.catalogs |
+----------------+------------------------+--------------------------------------+--+
| ad_101 | http://www.google.com | catalog8|catalog1 |
| ad_102 | http://www.sohu.com | catalog6|catalog3 |
| ad_103 | http://www.baidu.com | catalog7 |
| ad_104 | http://www.qq.com | catalog5|catalog1|catalog4|catalog9 |
| ad_105 | http://sina.com | NULL |
+----------------+------------------------+--------------------------------------+--+

用clicklog 和adlist进行分析

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
select * from click_log
+----------------------+------------------+-----------------------------+--+
| click_log.cookie_id | click_log.ad_id | click_log.time |
+----------------------+------------------+-----------------------------+--+
| 11 | ad_101 | 2014-05-01 06:01:12.334+01 |
| 22 | ad_102 | 2014-05-01 07:28:12.342+01 |
| 33 | ad_103 | 2014-05-01 07:50:12.33+01 |
| 11 | ad_104 | 2014-05-01 09:27:12.33+01 |
| 22 | ad_103 | 2014-05-01 09:03:12.324+01 |
| 33 | ad_102 | 2014-05-02 19:10:12.343+01 |
| 11 | ad_101 | 2014-05-02 09:07:12.344+01 |
| 35 | ad_105 | 2014-05-03 11:07:12.339+01 |
| 22 | ad_104 | 2014-05-03 12:59:12.743+01 |
| 77 | ad_103 | 2014-05-03 18:04:12.355+01 |
| 99 | ad_102 | 2014-05-04 00:36:39.713+01 |
| 33 | ad_101 | 2014-05-04 19:10:12.343+01 |
| 11 | ad_101 | 2014-05-05 09:07:12.344+01 |
| 35 | ad_102 | 2014-05-05 11:07:12.339+01 |
| 22 | ad_103 | 2014-05-05 12:59:12.743+01 |
| 77 | ad_104 | 2014-05-05 18:04:12.355+01 |
| 99 | ad_105 | 2014-05-05 20:36:39.713+01 |
+----------------------+------------------+-----------------------------+--+

求每一个人所访问的ad_id 去重

1
2
3
4
5
6
7
8
9
10
11
12
13
select cookie_id,collect_set(ad_id) from click_log group by cookie_id;

INFO : OK
+------------+-------------------------------+--+
| cookie_id | _c1 |
+------------+-------------------------------+--+
| 11 | ["ad_101","ad_104"] |
| 22 | ["ad_102","ad_103","ad_104"] |
| 33 | ["ad_103","ad_102","ad_101"] |
| 35 | ["ad_105","ad_102"] |
| 77 | ["ad_103","ad_104"] |
| 99 | ["ad_102","ad_105"] |
+------------+-------------------------------+--+

不用去重操作可以用 collect_list()函数

1
2
3
4
5
6
7
8
9
10
11
12
select cookie_id,collect_list(ad_id) from click_log group by cookie_id;

+------------+----------------------------------------+--+
| cookie_id | _c1 |
+------------+----------------------------------------+--+
| 11 | ["ad_101","ad_104","ad_101","ad_101"] |
| 22 | ["ad_102","ad_103","ad_104","ad_103"] |
| 33 | ["ad_103","ad_102","ad_101"] |
| 35 | ["ad_105","ad_102"] |
| 77 | ["ad_103","ad_104"] |
| 99 | ["ad_102","ad_105"] |
+------------+----------------------------------------+--+

统计每个人对于每个广告的访问量

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
select cookie_id,ad_id ,count(1) from click_log group by  cookie_id,ad_id;

+------------+---------+------+--+
| cookie_id | ad_id | _c2 |
+------------+---------+------+--+
| 11 | ad_101 | 3 |
| 11 | ad_104 | 1 |
| 22 | ad_102 | 1 |
| 22 | ad_103 | 2 |
| 22 | ad_104 | 1 |
| 33 | ad_101 | 1 |
| 33 | ad_102 | 1 |
| 33 | ad_103 | 1 |
| 35 | ad_102 | 1 |
| 35 | ad_105 | 1 |
| 77 | ad_103 | 1 |
| 77 | ad_104 | 1 |
| 99 | ad_102 | 1 |
| 99 | ad_105 | 1 |
+------------+---------+------+--+

在上述的基础上面添加类别,即将ad_id对应的类别取出来,去ad_list关联出

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
select click.cookie_id,click.ad_id,click.amount,ad_list.catalogs
from
(select cookie_id,ad_id ,count(1) amount from click_log group by cookie_id,ad_id) click
join ad_list
on ad_list.ad_id = click.ad_id;

+------------------+--------------+---------------+--------------------------------------+--+
| click.cookie_id | click.ad_id | click.amount | ad_list.catalogs |
+------------------+--------------+---------------+--------------------------------------+--+
| 11 | ad_101 | 3 | catalog8|catalog1 |
| 11 | ad_104 | 1 | catalog5|catalog1|catalog4|catalog9 |
| 22 | ad_102 | 1 | catalog6|catalog3 |
| 22 | ad_103 | 2 | catalog7 |
| 22 | ad_104 | 1 | catalog5|catalog1|catalog4|catalog9 |
| 33 | ad_101 | 1 | catalog8|catalog1 |
| 33 | ad_102 | 1 | catalog6|catalog3 |
| 33 | ad_103 | 1 | catalog7 |
| 35 | ad_102 | 1 | catalog6|catalog3 |
| 35 | ad_105 | 1 | NULL |
| 77 | ad_103 | 1 | catalog7 |
| 77 | ad_104 | 1 | catalog5|catalog1|catalog4|catalog9 |
| 99 | ad_102 | 1 | catalog6|catalog3 |
| 99 | ad_105 | 1 | NULL |
+------------------+--------------+---------------+--------------------------------------+--+

catalogs字段的列转行

ad_101 | catalog8|catalog1 转换成为

ad_101 |catalog1
ad_101 |catalog8

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
select ad_id,catalog from ad_list lateral view outer explode(split(catalogs,'\\|')) t as catalog;
+---------+-----------+--+
| ad_id | catalog |
+---------+-----------+--+
| ad_101 | catalog8 |
| ad_101 | catalog1 |
| ad_102 | catalog6 |
| ad_102 | catalog3 |
| ad_103 | catalog7 |
| ad_104 | catalog5 |
| ad_104 | catalog1 |
| ad_104 | catalog4 |
| ad_104 | catalog9 |
| ad_105 | NULL |
+---------+-----------+--+

将上述的outer去掉最后的NULL也就去掉了。

需求:将下面的catalogs列中的进行排序

1
2
3
4
5
6
7
8
9
+---------+--------------------------------------+--+
| ad_id | catalogs |
+---------+--------------------------------------+--+
| ad_101 | catalog8|catalog1 |
| ad_102 | catalog6|catalog3 |
| ad_103 | catalog7 |
| ad_104 | catalog5|catalog1|catalog4|catalog9 |
| ad_105 | NULL |
+---------+--------------------------------------+--

转换为数组进行排序
将表加载成为数组:

1
2
create table ad_list_2(ad_id string,url string,catalogs array<string>)row format 
delimited fields terminated by '\t' COLLECTION ITEMS TERMINATED BY '|';

加载数据:

1
2
3
4
5
6
7
8
9
10
11
12
load data local inpath '/home/hadoop/data/ad_list.txt' into table ad_list_2;


+------------------+------------------------+------------------------------------------------+--+
| ad_list_2.ad_id | ad_list_2.url | ad_list_2.catalogs |
+------------------+------------------------+------------------------------------------------+--+
| ad_101 | http://www.google.com | ["catalog8","catalog1"] |
| ad_102 | http://www.sohu.com | ["catalog6","catalog3"] |
| ad_103 | http://www.baidu.com | ["catalog7"] |
| ad_104 | http://www.qq.com | ["catalog5","catalog1","catalog4","catalog9"] |
| ad_105 | http://sina.com | NULL |
+------------------+------------------------+------------------------------------------------+--+

结果:

1
2
3
4
5
6
7
8
9
10
11
select ad_id,url,sort_array(catalogs) from ad_list_2;
INFO : OK
+---------+------------------------+------------------------------------------------+--+
| ad_id | url | _c2 |
+---------+------------------------+------------------------------------------------+--+
| ad_101 | http://www.google.com | ["catalog1","catalog8"] |
| ad_102 | http://www.sohu.com | ["catalog3","catalog6"] |
| ad_103 | http://www.baidu.com | ["catalog7"] |
| ad_104 | http://www.qq.com | ["catalog1","catalog4","catalog5","catalog9"] |
| ad_105 | http://sina.com | NULL |
+---------+------------------------+------------------------------------------------+--+

时间函数

查看当前时间:
select current_stamp from dual;
INFO : OK

1
2
3
4
5
+--------------------------+
| _c0 |
+--------------------------+
| 2017-09-06 10:56:56.992 |
+--------------------------+

计算自1970年以来的毫秒值:

1
2
3
4
5
6
7
select unix_timestamp from dual;

+-------------+
| _c0 |
+-------------+
| 1504666616 |
+-------------+

unix_timestamp转换时间为自1970-01-01以来的秒值

1
2
3
4
5
6
7
8
desc function extended unix_timestamp;
INFO : OK
+--------------------------------------------------------------------------------------------+
| tab_name |
+--------------------------------------------------------------------------------------------+
| unix_timestamp(date[, pattern]) - Converts the time to a number |
| Converts the specified time to number of seconds since 1970-01-01. The unix_timestamp(void) overload is deprecated, use current_timestamp. |
+--------------------------------------------------------------------------------------------+
1
2
3
4
5
6
7
select unix_timestamp('2017-09-06 10:56:56') from dual;
INFO : OK
+-------------+
| _c0 |
+-------------+
| 1504666616 |
+-------------+
1
2
3
4
5
6
7
8
select unix_timestamp('20170201 20:12:12','yyyyMMdd HH:mm:ss') from dual;

INFO : OK
+-------------+
| _c0 |
+-------------+
| 1485951132 |
+-------------+

将给定的unix的秒数转换成为指定的格式的日期

1
2
3
4
5
6
7
8
9
10
desc function extended from_unixtime();
将给定的unix的秒数转换成为指定的格式的日期
+--------------------------------------------------------------------------------+
| tab_name |
+--------------------------------------------------------------------------------+
| from_unixtime(unix_time, format) - returns unix_time in the specified format |
| Example: |
| > SELECT from_unixtime(0, 'yyyy-MM-dd HH:mm:ss') FROM src LIMIT 1; |
| '1970-01-01 00:00:00' |
+--------------------------------------------------------------------------------+

字符串转日期:to_date

1
2
3
4
5
6
7
8
9
10
desc function extended to_date;	
INFO : OK
+--------------------------------------------------------------------------------+
| tab_name |
+--------------------------------------------------------------------------------+
| to_date(expr) - Extracts the date part of the date or datetime expression expr |
| Example: |
| > SELECT to_date('2009-07-30 04:17:52') FROM src LIMIT 1; |
| '2009-07-30' |
+--------------------------------------------------------------------------------+
1
2
3
4
5
6
7
select to_date('2022-12-12 12:45:12') from dual;
INFO : OK
+-------------+
| _c0 |
+-------------+
| 2022-12-12 |
+-------------+
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
年:
select year('2012-12-22') from dual;
INFO : OK
+-------+
| _c0 |
+-------+
| 2012 |
+-------+

月:
select month('2022-12-12 12:45:12') from dual;
INFO : OK
+------+
| _c0 |
+------+
| 12 |
+------+
日:day 时:hour 分:minute 秒:second

日期加减运算date_add & date_sub

1
2
date_add(start_date, num_days) - Returns the date that is num_days after start_date.
date_sub(start_date, num_days) - Returns the date that is num_days before start_date.

类型转换:

1
2
3
cast(value as type)
select cast('10' as int) from dual;
select cast(current_timestamp as date) from dual;

注意:上述情况Null值转换不了

binary 只能直接转成string


数学运算

1
2
3
4
5
6
7
round()	四舍五入
ceil()
ceiling()
floor()
abs()
least()
greatest()

字符串:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
substr
substring
concat
concat_ws
select concat_ws(".","192","168","52","45") from dual;
INFO : OK
+----------------+
| _c0 |
+----------------+
| 192.168.52.45 |
+----------------+
1 row selected (0.064

split
select split("192.168.52.100","\\.") from dual;
INFO : OK
+---------------------------+
| _c0 |
+---------------------------+
| ["192","168","52","100"] |
+---------------------------+

JSON 字符串操作

1
2
3
4
5
6
7
8
9
10
11
12
13
14
create table rating(json string)
INFO : OK
+-------------------------------------------------------------+
| rating.json |
+-------------------------------------------------------------+
| {"movie":"1193","rate":"5","time":"978300760","userid":"1"} |
| {"movie":"661","rate":"3","time":"978300761","userid":"1"} |
| {"movie":"914","rate":"3","time":"978300762","userid":"1"} |
| {"movie":"325","rate":"4","time":"978300763","userid":"1"} |
| {"movie":"1443","rate":"3","time":"978300760","userid":"1"} |
| {"movie":"1893","rate":"2","time":"978301760","userid":"1"} |
| {"movie":"1153","rate":"3","time":"978302760","userid":"1"} |
| {"movie":"1893","rate":"1","time":"978303760","userid":"1"} |
+-------------------------------------------------------------+

json_tuple函数

1
2
3
4
5
6
7
desc function extended json_tuple;

+---------------------------------------------------------------------------------------------+
| tab_name |
+---------------------------------------------------------------------------------------------+
| json_tuple(jsonStr, p1, p2, ..., pn) - like get_json_object, but it takes multiple names and return a tuple. All the input parameters and output column types are string. |
+---------------------------------------------------------------------------------------------+

使用案例:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
select json_tuple(json,'movie','rate','time','userid') as (movie,rate,time,userid) from rating;

INFO : OK
+--------+-------+------------+---------+
| movie | rate | time | userid |
+--------+-------+------------+---------+
| 1193 | 5 | 978300760 | 1 |
| 661 | 3 | 978300761 | 1 |
| 914 | 3 | 978300762 | 1 |
| 325 | 4 | 978300763 | 1 |
| 1443 | 3 | 978300760 | 1 |
| 1893 | 2 | 978301760 | 1 |
| 1153 | 3 | 978302760 | 1 |
| 1893 | 1 | 978303760 | 1 |
+--------+-------+------------+---------+
1
2
3
4
5
6
7
8
9
10
11
12
13
select userid,movie,rate,year(from_unixtime(cast(time as bigint))) as date from (select json_tuple(json,'movie','rate','time','userid') as (movie,rate,time,userid) from rating) t;
+---------+--------+-------+-------+
| userid | movie | rate | date |
+---------+--------+-------+-------+
| 1 | 1193 | 5 | 2001 |
| 1 | 661 | 3 | 2001 |
| 1 | 914 | 3 | 2001 |
| 1 | 325 | 4 | 2001 |
| 1 | 1443 | 3 | 2001 |
| 1 | 1893 | 2 | 2001 |
| 1 | 1153 | 3 | 2001 |
| 1 | 1893 | 1 | 2001 |
+---------+--------+-------+-------+
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
select userid,movie,rate,to_date(from_unixtime(time, 'yyyy-MM-dd HH:mm:ss')) as date from (select json_tuple(json,'movie','rate','time','userid') as (movie,rate,time,userid) from rating) t;



select userid,movie,rate,to_date(from_unixtime(cast(time as bigint), 'yyyy-MM-dd HH:mm:ss')) as datess from (select json_tuple(json,'movie','rate','time','userid') as (movie,rate,time,userid) from rating) t;
+---------+--------+-------+-------------+
| userid | movie | rate | datess |
+---------+--------+-------+-------------+
| 1 | 1193 | 5 | 2001-01-01 |
| 1 | 661 | 3 | 2001-01-01 |
| 1 | 914 | 3 | 2001-01-01 |
| 1 | 325 | 4 | 2001-01-01 |
| 1 | 1443 | 3 | 2001-01-01 |
| 1 | 1893 | 2 | 2001-01-01 |
| 1 | 1153 | 3 | 2001-01-01 |
| 1 | 1893 | 1 | 2001-01-01 |
+---------+--------+-------+-------------+

解析URL函数parse_url_tuple

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
desc function extended parse_url_tuple
+------------------------------------------------------------------------------------------------------------------------+
| tab_name |
+------------------------------------------------------------------------------------------------------------------------+
| parse_url_tuple(url, partname1, partname2, ..., partnameN) - extracts N (N>=1) parts from a URL. |
| It takes a URL and one or multiple partnames, and returns a tuple. All the input parameters and output column types are
| string. |
| Partname: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, USERINFO, QUERY:<KEY_NAME> |
| Note: Partnames are case-sensitive, and should not contain unnecessary white spaces. |
| Example: |
| > SELECT b.* FROM src LATERAL VIEW parse_url_tuple(fullurl, 'HOST', 'PATH', 'QUERY', 'QUERY:id') b as host, path,
| query, query_id LIMIT 1; |
| > SELECT parse_url_tuple(a.fullurl, 'HOST', 'PATH', 'QUERY', 'REF', 'PROTOCOL', 'FILE', 'AUTHORITY', 'USERINFO',
| 'QUERY:k1') as (ho, pa, qu, re, pr, fi, au, us, qk1) from src a; |
+------------------------------------------------------------------------------------------------------------------------+
1
2
3
4
5
6
7
8
select parse_url_tuple("http://www.baidu.com/big/webpage?cookieid=10","HOST","PATH","QUERY") from dual;

INFO : OK
+----------------+---------------+--------------+
| c0 | c1 | c2 |
+----------------+---------------+--------------+
| www.baidu.com | /big/webpage | cookieid=10 |
+----------------+---------------+--------------+
1
2
3
4
5
6
7
8
select parse_url_tuple("http://www.baidu.com/big/webpage?cookieid=10","HOST","PATH","QUERY","QUERY:cookieid") from dual;

INFO : OK
+----------------+---------------+--------------+-----+
| c0 | c1 | c2 | c3 |
+----------------+---------------+--------------+-----+
| www.baidu.com | /big/webpage | cookieid=10 | 10 |
+----------------+---------------+--------------+-----+

其他函数

判断是否是NULL值

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
select ename,job,comm,isnull(comm) isnull,isnotnull(comm) isnotnull from emp;

INFO : OK
+---------+------------+---------+---------+------------+
| ename | job | comm | isnull | isnotnull |
+---------+------------+---------+---------+------------+
| SMITH | CLERK | NULL | true | false |
| ALLEN | SALESMAN | 300.0 | false | true |
| WARD | SALESMAN | 500.0 | false | true |
| JONES | MANAGER | NULL | true | false |
| MARTIN | SALESMAN | 1400.0 | false | true |
| BLAKE | MANAGER | NULL | true | false |
| CLARK | MANAGER | NULL | true | false |
| SCOTT | ANALYST | NULL | true | false |
| KING | PRESIDENT | NULL | true | false |
| TURNER | SALESMAN | 0.0 | false | true |
| ADAMS | CLERK | NULL | true | false |
| JAMES | CLERK | NULL | true | false |
| FORD | ANALYST | NULL | true | false |
| MILLER | CLERK | NULL | true | false |
| HIVE | PROGRAM | NULL | true | false |
+---------+------------+---------+---------+------------+

assert_true函数

1
2
3
4
5
6
7
8
9
10
desc function extended assert_true;
INFO : OK
+----------------------------------------------------+
| tab_name |
+----------------------------------------------------+
| assert_true(condition) - Throw an exception if 'condition' is not true. |
| Example: |
| > SELECT assert_true(x >= 0) FROM src LIMIT 1; |
| NULL |
+----------------------------------------------------+

elt()函数

1
2
3
4
5
6
7
8
9
10
11
desc function extended elt;

INFO : OK
+----------------------------------------------------+
| tab_name |
+----------------------------------------------------+
| elt(n, str1, str2, ...) - returns the n-th string |
| Example: |
| > SELECT elt(1, 'face', 'book') FROM src LIMIT 1;|
| 'face' |
+----------------------------------------------------+
1
2
3
4
5
6
7
select elt(2,"hello","hive","spark") from dual;
INFO : OK
+-------+
| _c0 |
+-------+
| hive |
+-------+

处理空值函数NVL

1
2
3
4
5
6
7
8
9
10
11
desc function extended nvl;

INFO : OK
+----------------------------------------------------+
| tab_name |
+----------------------------------------------------+
| nvl(value,default_value) - Returns default value if value is null else returns value |
| Example: |
| > SELECT nvl(null,'bla') FROM src LIMIT 1; |
| bla |
+----------------------------------------------------+
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
select ename,comm,nvl(comm,'sss') from emp;
INFO : OK
+---------+---------+---------+
| ename | comm | _c2 |
+---------+---------+---------+
| SMITH | NULL | sss |
| ALLEN | 300.0 | 300.0 |
| WARD | 500.0 | 500.0 |
| JONES | NULL | sss |
| MARTIN | 1400.0 | 1400.0 |
| BLAKE | NULL | sss |
| CLARK | NULL | sss |
| SCOTT | NULL | sss |
| KING | NULL | sss |
| TURNER | 0.0 | 0.0 |
| ADAMS | NULL | sss |
| JAMES | NULL | sss |
| FORD | NULL | sss |
| MILLER | NULL | sss |
| HIVE | NULL | sss |
+---------+---------+---------+

总结

Hive的Array类型:
创建表时候指定字段为array类型 location array
指定array中每个的分隔符COLLECTION ITEMS TERMINATED BY ‘,’

取出数据:
直接用对应字段的名称加上索引取出值 location[0]
求array里面数据的大小用size(location)
求array里面数据包含某个数据用array_contains(location,‘shanghai’)

Hive Map类型:
创建表:指定Map类型 members Map<string,string>
指定Map Key的分隔符和字段之间的分隔符
collection items terminated by '#'
map keys terminated by ‘:’;

取数据:
求Map数据的Key对应的Value值 members[‘father’]
求Map数据的所有的key map_keys(members)
求Map数据的所有的Value map_values(members)
求Map数据的Key-Value 的个数 size(members)

Hive 结构体:
创建表时候指定字段为array类型 info struct<name:string,age:int>
指定array中每个的分隔符COLLECTION ITEMS TERMINATED BY ','
取数据直接是进行字段名称点字段值例如info.age

列转行函数:
lateral view outer explode(split(catalogs,’\|’)) t as catalog;
对Array类型数据进行排序:
select ad_id,url,sort_array(catalogs) from ad_list_2;